Introduction

This project takes NBA scheduling data and answers basic questions, then creates trends and visualizations for schedules across a decade using ggplot and the plotly package. Then a model is created to measure the amount of wins each team has gained or lost due to scheduling variables. The data to this project will be unavailable to you because it comes from a source that doesn’t want to share that information so the answers to the questions will be hard to judge accuracy but I hope the code and analysis can demonstrate my Data Science skills in R. To remind I am also proficient in SQL, JAVA, and Python.

Note:

Throughout this document, any season column represents the year each season started. For example, the 2015-16 season will be in the dataset as 2015. We may refer to a season by just this number (e.g. 2015) instead of the full text (e.g. 2015-16).

Answers

Part 1

Setup and Data

library(tidyverse)
library(readr)
# Note, you will likely have to change these paths. If your data is in the same folder as this project, 
# the paths will likely be fixed for you by deleting ../../Data/schedule_project/ from each string.

#These lines assign the excel spreadsheets to data frames
locations <- read_csv("Data/locations.csv")
schedule <- read_csv("Data/schedule.csv")
schedule_24_partial <- read_csv("Data/schedule_24_partial.csv")
team_game_data <- read_csv("Data/team_game_data.csv")

Part 1 – Schedule Analysis

Question 1

QUESTION: How many times are the Thunder scheduled to play 4 games in 6 nights in the provided 80-game draft of the 2024-25 season schedule? (Note: clarification, the stretches can overlap, the question is really “How many games are the 4th game played over the past 6 nights?”)

# Here and for all future questions, feel free to add as many code chunks as you like. Do NOT put echo = F though, we'll want to see your code.

#This line creates a new df including only OKC's 2024 season games as well as arranges the games chronologically.(df = data frame)
OKC24 <- schedule_24_partial%>%
  filter(team == "OKC")%>%
  arrange(gamedate)

#Tracks how many instances of 4 games in 6 days we accumulate running through the df.
tracker <- 0

#This loop generates how many 4 in 6 's OKC had in their 80 game draft schedule for the `24 season. I accomplished this going to the i (1st) game chronologically and subtracting its date by the i+3 (4th) game date and if that number was less than or equal to 5 then the team had played 4 games within 6 days. You must choose 5 instead of 6 because the 1st game date needs to count as a 0 but really counts as a 1
for(i in seq_along(OKC24$gamedate))
{
  if((as.numeric(OKC24$gamedate[i + 3] - OKC24$gamedate[i])) <= 5 &&
     !is.na(as.numeric(OKC24$gamedate[i + 3] - OKC24$gamedate[i])))
    {
      tracker <- tracker + 1
    }
}

ANSWER 1:

26 4-in-6 stretches in OKC’s draft schedule.

Question 2

QUESTION: From 2014-15 to 2023-24, what is the average number of 4-in-6 stretches for a team in a season? Adjust each team/season to per-82 games before taking your final average.

#Arranges the schedule df in alphabetical order by team and then by each game for that team from the `14-`23 season chronologically
schedule <- schedule%>%
  arrange(team , gamedate)%>%
  mutate(is_4in6 = 0)

tracker2 <- 0

#Generates the total amount of 4 in 6's in the schedule df using the same logic as before
for(b in seq_along(schedule$gamedate))
{
  if((as.numeric(schedule$gamedate[b + 3] - schedule$gamedate[b])) <= 5 &&
     !is.na(as.numeric(schedule$gamedate[b + 3] - schedule$gamedate[b])))
    {
      tracker2 <- tracker2 + 1
      schedule$is_4in6[b+3] <- 1
  }
}

#These variables help make the statistic per82 games
amount_of_seasons <- last(schedule$season) - schedule$season[1] + 1
amount_of_teams <- as.double(n_distinct(schedule$team))
tracker2per82 <- tracker2 / as.double(length(schedule$gamedate))
answer2 <- tracker2per82 * 82

ANSWER 2:

25.3 4-in-6 stretches on average.

Question 3

QUESTION: Which of the 30 NBA teams has had the highest average number of 4-in-6 stretches between 2014-15 and 2023-24? Which team has had the lowest average? Adjust each team/season to per-82 games.

#Used to create a cut line in the new df to get rid of the large conglomerate of useless NA cells.
cutline <- 82 * amount_of_seasons

#This new df holds each team as an individual column with its gamedate data in the rows. I chose to keep this list in df format because I find it easier to view and traversing through the NA cells isn't a problem for my task.
schedule_w_team <- schedule%>%
  mutate(row = row_number())%>%
  pivot_wider(id_cols = row , names_from = team , values_from = gamedate)%>%
  select(-row)%>%
  mutate(across(everything(), ~ 
    {
      #This mutate creates a new list containing all of the non NA data in the df and then creates a new         character vector that puts the non NA data at the top and fills the bottom with NA's based off the           length of the end of the column subtracted by the length of the non NA list.
      non_na <- .x[!is.na(.x)]
      c(non_na , rep(NA, length(.x) - length(non_na)))
    }
  ))%>%
  slice(1:cutline)

#Lowest team average 4 in 6

lowest2 <- 0
lowest1 <- Inf

#This loop goes through the rest of the teams and gets the amount of 4 in 6's they have in their `14-`23 seasons and compares it to the current lowest amount while keeping track of which team is currently the lowest.
for(col_name in names(schedule_w_team))
{

  for(c in seq_along(schedule_w_team[[col_name]]))
  {
    if((as.numeric(schedule_w_team[[col_name]][c + 3] - schedule_w_team[[col_name]][c])) <= 5 &&
      !is.na(as.numeric(schedule_w_team[[col_name]][c + 3] - schedule_w_team[[col_name]][c])))
      {
        lowest2 <- lowest2 + 1
      }
  }
  
  if(lowest2 < lowest1)
    {
      lowest1 <- lowest2
      lowest_team <- col_name
    }
  lowest2 <- 0
}

#These variables gather the per82 average between the `14-`23 seasons of the lowest team.
games_lowest <- schedule_w_team[[lowest_team]][!is.na(schedule_w_team[[lowest_team]])]
games_lowest <- as.double(length(games_lowest))
lowest_per82 <- lowest1/games_lowest * cutline/amount_of_seasons

#Highest team average 4 in 6

highest <- 0
highest1 <- 0

#This loop goes through the teams and gets the amount of 4 in 6's they have in their `14-`23 seasons and compares it to the current highest amount while keeping track of which team is currently the highest.
for(col_name2 in names(schedule_w_team))
{

  for(d in seq_along(schedule_w_team[[col_name2]]))
  {
    if((as.numeric(schedule_w_team[[col_name2]][d + 3] - schedule_w_team[[col_name2]][d])) <= 5 &&
      !is.na(as.numeric(schedule_w_team[[col_name2]][d + 3] - schedule_w_team[[col_name2]][d])))
      {
        highest <- highest + 1
      }
  }
  
  if(highest > highest1)
    {
      highest1 <- highest
      highest_team <- col_name2
    }
  highest <- 0
}

#These variables gather the per82 average between the `14-`23 seasons of the highest team.
games_highest <- schedule_w_team[[highest_team]][!is.na(schedule_w_team[[highest_team]])]
games_highest <- as.double(length(games_highest))
highest_per82 <- highest1/games_highest * cutline/amount_of_seasons

#These variables calculate a Margin of error for NBA team's average 4 in 6's from `14-`23 seasons which I used for some analysis
MOE_high <- highest_per82-answer2
MOE_low <- answer2 - lowest_per82

ANSWER 3:

  • Most 4-in-6 stretches on average: CHA : 28
  • Fewest 4-in-6 stretches on average: NYK : 22.3

Question 4

QUESTION: Is the difference between most and least from Q3 surprising, or do you expect that size difference is likely to be the result of chance?

ANSWER 4:

The difference between the most and the least from Q3 is not that surprising to me considering that if the NBA aimed to make every team have the average amount of 4 in 6 games per season (25.3481927), there would be a margin of error caused by chance/outside factors. These factors could be tight scheduling caused by large distance destinations or just general time crunch the NBA needs to follow in order to have all teams complete a certain amount of games by the All Star break or their total 82 by the end of the season.The NBA is not perfect when it comes to scheduling and managing this amount of teams makes giving each team the same number of 4 in 6’s difficult. Either way the fact that the average margin of error over a 10 year average of the whole NBA is +2.6745059 and - 3.0408879 occurrences of 4 in 6 games makes each team’s schedule pretty fair in comparison.

Question 5

QUESTION: What was BKN’s defensive eFG% in the 2023-24 season? What was their defensive eFG% that season in situations where their opponent was on the second night of back-to-back?

#Creates a new variable OffB2B to determine whether the offensive team is on the second game of a back to back and rearranges the rows and data
team_game_data <- team_game_data%>%
  mutate(OffB2B = 0)%>%
  arrange(off_team , gamedate)%>%
  relocate(OffB2B , gamedate)

#This loop actually fills OffB2B with a 1 if the offensive team is on the second game of a back to back using the same logic as the 4 in 6 loop but setting the value equal to 1.
for(a in seq_along(team_game_data$gamedate))
  {
    if((as.numeric(team_game_data$gamedate[a + 1] - team_game_data$gamedate[a])) == 1 &&
     !is.na(as.numeric(team_game_data$gamedate[a + 1] - team_game_data$gamedate[a])))
    {
      team_game_data$OffB2B[a + 1] <-  1
    }
  }

#Creates a new df including the rows below as well as filters only games in the `23 season where BKN is the defensive team
BKN23D <- team_game_data%>%
  select(gamedate , off_team, def_team, fgmade, fg3made, fgattempted, OffB2B)%>%
  filter(gamedate > as.Date("2023-08-06") & gamedate < as.Date("2024-08-06") & def_team == "BKN")

#Calculates defensive eFG%
eFG <- ((sum(BKN23D$fgmade) + .5 * sum(BKN23D$fg3made))/sum(BKN23D$fgattempted)) * 100

#Creates a new df including only games from BKN23D where the opponent was on the second night of a back to back
BKN23_O_B2B <- filter(BKN23D , OffB2B == 1)

#Calculates the eFG% of all back to backs
eFG_B2B <- ((sum(BKN23_O_B2B$fgmade) + .5 * sum(BKN23_O_B2B$fg3made))/sum(BKN23_O_B2B$fgattempted)) * 100

ANSWER 5:

  • BKN Defensive eFG%: 54.3%
  • When opponent on a B2B: 53.5%

Part 3 – Modeling

Question 9

QUESTION: Please estimate how many more/fewer regular season wins each team has had due to schedule-related factors from 2019-20 though 2023-24. Your final answer should have one number for each team, representing the total number of wins (not per 82, and not a per-season average). You may consider the on-court strength of the scheduled opponents as well as the impact of travel/schedule density. Please include the teams and estimates for the most helped and most hurt in the answer key.

If you fit a model to help answer this question, please write a paragraph explaining your model, and include a simple model diagnostic (eg a printed summary of a regression, a variable importance plot, etc).

#Creates a new df that includes only the `19-`23 seasons and a select few variables
season19_23 <- schedule%>%
  filter(season >= 2019)%>%
  select(season, gamedate, team, opponent, home, win, is_b2b, miles_travelled)

#Creates a new variables determining whether a team is travelling in back to back games to help identify any travel clumps
season19_23 <- mutate(season19_23, is_travelling_b2b = ifelse((season19_23$season == lag(season)), 
                             ifelse(season19_23$home == lag(home), 
                             ifelse((season19_23$home == 0), 1, 0), 0), 0))

#Makes the first row of the is_travelling_b2b column 0 instead of NA
season19_23$is_travelling_b2b[is.na(season19_23$is_travelling_b2b)] <- 0

#Creates the linear model
model <- lm(win ~ is_travelling_b2b  + miles_travelled + is_b2b, data = season19_23)
summary(model)
## 
## Call:
## lm(formula = win ~ is_travelling_b2b + miles_travelled + is_b2b, 
##     data = season19_23)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.54917 -0.49517  0.03808  0.48905  0.62393 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        5.492e-01  7.334e-03  74.879  < 2e-16 ***
## is_travelling_b2b -5.759e-02  1.035e-02  -5.564 2.70e-08 ***
## miles_travelled   -4.109e-05  8.485e-06  -4.843 1.29e-06 ***
## is_b2b            -6.465e-02  1.236e-02  -5.230 1.72e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4982 on 11654 degrees of freedom
## Multiple R-squared:  0.007346,   Adjusted R-squared:  0.00709 
## F-statistic: 28.75 on 3 and 11654 DF,  p-value: < 2.2e-16
#Contains the VIF function
library(car)

#Shows the Variance Inflation Factors
vif(model)
## is_travelling_b2b   miles_travelled            is_b2b 
##          1.015400          1.022766          1.027152

Model explanation: Wins predicted by whether a team is travelling in back to back games in order to identify travel clumps, miles traveled to identify any long flights, and if a game is a back to back game. All variables that could create poor play due to exhaustion and may lead to losses.

Model analysis: I decided to use a linear model to help track how much variance in expected wins could be explained by scheduling variables. However, after messing around with several factors such as if the game was a series or if the game was a 4 in 6 game I ended at the best possible outcome where all the explanatory variables were statistically significant with a p value of <.05. The model turned out to use miles traveled, if a team was travelling back to back, and if the game was a back to back game. Unfortunately, I was only able to capture a r^2 value of a little less than 1%, but it was significant due to the F statistic having a p value of <.05 as well. I am displeased with my model, but after going through numerous options excluding strength of opponent(because I felt like this variable went a little farther than strictly scheduling) I came to the conclusion that I was actually pleased with my model. Sure there can be room for improvement but I made three significant variables (and learned a completely new function(distHaversine) to calculate distance between two coordinates) that I felt affected a NBA team’s performance to a significant extent. And yes, it does not explain much variance, but ultimately these factors are not the final tell of if a team will win that night. You must use the statistic that decides if you ultimately win the game or not, points scored and points let up, as well as other various game performance statistics. So overall, my conclusions came out weak, but also strong. My factors are significant and do not have collinearity (see VIF < 10 for all), but scheduling is not a significant factor in determining whether a team will win that game.

#Creates a new column called win factor and populates it with the amount of wins each game is likely to produce using the model created above
season19_23 <- season19_23%>%
  mutate(win_factor = coef(model)["(Intercept)"] + is_travelling_b2b*coef(model)["is_travelling_b2b"] + miles_travelled*coef(model)["miles_travelled"]+ is_b2b * coef(model)["is_b2b"])

#These lines are used to help organize future dfs
amount_of_seasons2 <- last(season19_23$season) - season19_23$season[1] + 1
cutline2 <- amount_of_seasons2 * 82

#This line creates a new df that puts each win_factor in chronological order with its respective team.
season19_23_wins <- season19_23%>%
  mutate(row = row_number())%>%
  pivot_wider(id_cols = row , names_from = team , values_from = win_factor)%>%
  select(-row)%>%
  mutate(across(everything(), ~ 
    {
      #This function creates a new list containing all of the non NA data in the df and then creates a new         character vector that puts the non NA data at the top and fills the bottom with NA's based off the           length of the end of the column subtracted by the length of the non NA list.
      non_na <- .x[!is.na(.x)]
      c(non_na , rep(NA, length(.x) - length(non_na)))
    }
  ))%>%
  slice(1:cutline2)

#Most hurt

wins_lowest <- Inf
wins_lowest2 <- 0

#This loop adds all the win_factor rows for all the teams and finds the team with the highest wins after subtracting each row by .5. I did this because there are 2 outcomes that each have a .5 chance with no outside bias, so the added or lost chance for a win due to scheduling factors that are reflected in the model will be found by subtracting .5 from win-factor.
for(col_name3 in names(season19_23_wins))
{
  
  for(e in seq_along(season19_23_wins[[col_name3]]))
  {
      if(!is.na(season19_23_wins[[col_name3]][e]))
      {
        wins_lowest2 <- wins_lowest2 + season19_23_wins[[col_name3]][e] - .5
      }
  }
  
  if(wins_lowest2 < wins_lowest)
    {
      wins_lowest <- wins_lowest2
      most_hurt <- col_name3
    }
  wins_lowest2 <- 0
}

#Most helped

wins_highest <- 0
wins_highest2 <- 0

#This loop does the same as the most helped loop but finds the most hurt team.
for(col_name4 in names(season19_23_wins))
{
  
  for(f in seq_along(season19_23_wins[[col_name4]]))
  {
      if(!is.na(season19_23_wins[[col_name4]][f]))
      {
        wins_highest2 <- wins_highest2 + season19_23_wins[[col_name4]][f] - .5
      }
  }
  
  if(wins_highest2 > wins_highest)
    {
      wins_highest <- wins_highest2
      most_helped <- col_name4
    }
  wins_highest2 <- 0
}

#Sums up each column's rows subtracted by .5 in season19_23 wins and ignores any NA values and puts it into team_wins, a vector
team_wins <- sapply(season19_23_wins, function(row) sum(row - .5, na.rm = TRUE))

#Creates a nice df called wins showing each teams wins_gained from the model
wins <- data.frame(team = names(team_wins), wins_gained = team_wins)

#prints wins
wins
##     team wins_gained
## ATL  ATL  1.18196970
## BKN  BKN  0.05442204
## BOS  BOS -0.17343132
## CHA  CHA  1.01073554
## CHI  CHI  1.19807812
## CLE  CLE  1.74243218
## DAL  DAL  0.28573307
## DEN  DEN  0.13571990
## DET  DET  1.63174966
## GSW  GSW -1.68274724
## HOU  HOU -0.13540172
## IND  IND  1.79418785
## LAC  LAC -1.18413977
## LAL  LAL -0.44134162
## MEM  MEM  0.07741422
## MIA  MIA -1.08144629
## MIL  MIL  1.22759918
## MIN  MIN -0.65929802
## NOP  NOP -0.76703029
## NYK  NYK  0.77914784
## OKC  OKC  0.17595639
## ORL  ORL -0.68255543
## PHI  PHI  0.80177007
## PHX  PHX -1.10738646
## POR  POR -2.62059396
## SAC  SAC -2.07083638
## SAS  SAS -0.66337345
## TOR  TOR  0.69591682
## UTA  UTA -0.57936116
## WAS  WAS  1.05611052
#I did the above loops in this chunk for reproducibility instead of just reading off the most hurt and most helped.

ANSWER 9:

  • Most Helped by Schedule: IND: 1.79
  • Most Hurt by Schedule: POR: -2.62